Examination of a Wine-Based Dataset by Kellie English

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

This dataset contains information about the quality of different variations of red wine. There are 1599 observations and 13 variables.

Since I have little knowledge of wine, I researched the variables in relation to their importance in wine quality:

Fixed acidity relates to the sourness of wines- wines from grapes in cooler climates are higher in fixed acidity are more sour, while wines from grapes in warmer climates are low in acidity and therefore are more mild. http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity

Volatile acidity is in reference to the aecetic acid component found in some wines, which is usually not present and is mostly found in vinegars.
https://en.wikipedia.org/wiki/Acids_in_wine

Citric Acid is present in grapes, and is seen as affecting the ‘fresh’ taste in many wines. It occurs more frequently in white and rose wines than in reds, so I would expect to see lower values of citric acid in this dataset. https://www.winefrog.com/definition/243/citric-acid

Residual sugar is the sugar content of the wine, which balances the acidity. https://drinks.seriouseats.com/2013/04/wine-jargon-what-is-residual-sugar-riesling-fermentation-steven-grubbs.html

Chlorides contribute to the saltiness of the wine, and are derived from the soil in which the grapes are grown. http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0101-20612015000100095

Free sulfur dioxide occurs naturally in the wine, while Total sulfur dioxide includes the sulfates added by the winemaker to prevent the wine from going bad. Red wines usually have less added sulfates, so these numbers should be very similar. https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/ https://www.practicalwinery.com/janfeb09/page5.htm

Density of wine is determined by the concentration of “…alcohol, sugar, glycerol, and other dissolved solids.” https://www.etslabs.com/analyses/DEN

pH is a very good indicator of a wine’s quality. http://winemakersacademy.com/importance-ph-wine-making/

Sulphates are used to preserve the flavor and freshness of wine. https://www.scientificamerican.com/article/myths-about-sulfites-and-wine/

Alcohol is the alcohol content of the wine. Most reds are between 12 and 15%. http://winefolly.com/tutorial/alcohol-content-in-wine/

Quality is a rating of quality ranging from 3 to 8.

Univariate Plots Section

Here we will conduct a preliminary exploration of the dataset.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This plot shows fixed acidity of the entire dataset, with the majority clustering between about 7 and 10 rating.

This is volatile acidity, with binwidth set at 0.1- again, there is a large cluster between about 0.3 and 0.7, with a few outliers higher and lower.

This plot shows citric acid in the dataset. It mimics the first two graphs except for the large quantity of wines that have a very low level of citric acid- something we would expect from a dataset featuring many red wines, which tend to be lower in citric acid than whites.

A plot of residual sugars, most clustering between 1 and 3, with the most frequently occurring rating around 2.

In this plot of chlorides we see our first legitimate outlier, hovering just above 0.6, with the rest of the data resting at or below 0.1.

A plot of free sulfur dioxides. Most wines look as though they’re fairly low in these, with a few notable exceptions above 60.

This plot mimics the free sulfur dioxides, and rightly so- free sulfur dioxides are calculated as a part of total sulfur dioxides. The outliers here are far higher than the free sulfur dioxides plot, however.

Density is fairly similar for all of the wines in this dataset- the binwidth is set to 0.0001 in order to see some differentiation here. We can assume that perhaps density is very similar for all red wines.

pH is closely correlated with the quality of wine, so this is a variable that we will work with later on. Note the normalized distribution.

Sulfates also reflect similar patterns to the sulfur dioxide charts, as well as the acidity charts- a large cluster early on in the dataset, with a few outliers.

Alcohol- most wines have between 8% and 11%, with a few exceptions.

Quality- most of the wines are mid-range, with a quality of 5 or 6.

Univariate Analysis

What is the structure of your dataset?

The dataset is a series of wines, each with a numerical observation assigned to a series of 13 variables. There are more mid-quality wines than higher or lower quality.

What is/are the main feature(s) of interest in your dataset?

Quality of wine is most important to anyone trying to make an informed purchase- therefore, quality should be included in the analysis of this dataset.

What other features in the dataset do you think will help support your /

investigation into your feature(s) of interest?

Features affecting the flavor of the wine, such as citric acid and residual sugars. In addition, acidity and pH should affect the quality of the wine, so these should be examined as well.

Did you create any new variables from existing variables in the dataset?

No new variables have been created in the dataset.

No operations were performed to tidy the data.

Bivariate Plots Section

## Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  4 variables:
##  $ quality  : int  3 4 5 6 7 8
##  $ mean_pH  : num  3.4 3.38 3.3 3.32 3.29 ...
##  $ median_pH: num  3.39 3.37 3.3 3.32 3.28 3.23
##  $ n        : int  10 53 681 638 199 18

Created a new subset of the data using dplyr called wine.pH to determine the mean pH of the wine in this dataset.

As we can see, the pH is clearly higher in lower quality wine- which makes sense to anyone who’s ever had wine that has turned too vinegary!

Here we can see again the relationship between pH and quality. Despite the graph that we created before with the mean pH levels, it becomes apparent here that pH for low-quality or high-quality wines can occur almost anywhere on the spectrum. Mid-quality wines do cluster together, around 5 and 6, with a pH between about 3.1 and 3.5.

Here, quality and volatile acidity. There does appear to be a trend here, with the highest volatile acidity attributed to the lower quality wines.

Here is a comparison between residual sugars and fixed acidity. There doesn’t seem to be any correlation between these variables.

Here is a comparison between residual sugars and citric acid. Again, there doesn’t seem to be any correlation between these variables.

Here is a comparison between citric acid and fixed acidity. There is clearly a high positive correlation, though the highest density of wines occurs where fixed acidity is either 0 or very close to 0.

In this plot we see a positive correlation- again, not unexpected. The highest number of wines have low quantities of both total and free sulfur dioxides.

Surprisingly, fixed acidity and density appear to be positively correlated.

Alcohol content appears to have a very weak correlation with quality- the highest quality wines all have an alcohol content around 10% or above.

Here, the higher density wines on average have lower alcohol content. There is a very weak negative correlation apparent from this graph.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
### investigation. How did the feature(s) of interest vary with other features in
### the dataset?

Quality ended up being a fairly uninformative variable- the more interesting comparisons are between pH, density, and volatile acidity.

Did you observe any interesting relationships between the other features
### (not the main feature(s) of interest)?

What was the strongest relationship you found?

There are several strong relationships apparent: first between free sulfur dioxides and total sulfur dioxides which is to be expected, as we saw in the source linked above- free sulfur dioxides are counted within the total sulfur dioxides. Another strong relationship is between density and fixed acidity. The final strongly positive relationship is between citric acid and fixed acidity.

Multivariate Plots Section

We can see that there are some patterns beginning to emerge here- the higher quality wines have higher fixed acidity, and slightly lower density. Mid quality wines have higher density, and lower fixed acidity. Interestingly, the poor quality wines seem to be distributed throughout.

Here is a much weaker pattern than above, but still apparent: Mid quality wines have lower fixed acidity and lower citric acid, while higher quality wines have higher citric acid and higher fixed acidity. However, the data point with the highest citric acid also happens to be lower quality- look at 1.00 on the X axis for the yellow point.

Here the only pattern is that the mid-quality wines have a wider range of total sulfur dioxides, while the higher quality wines have below about 100 total sulfur dioxides.

Here, clearly, higher density wines are poorer quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Density tended to be interesting, as it was able to differentiate between qualities of wines. Citric acid was also surprising- the higher the content of citric acid, the better quality wine.

Were there any interesting or surprising interactions between features?

I was surprised that quality was not clearly delineated in many of the plots. I expected much firmer stripes of color, and many of the plots it’s impossible to see any clear patterns.


Final Plots and Summary

Plot One

Description One

I chose this plot because it shows a typical normal distribution of the data across various levels of pH. It is interesting because it is directly related to quality- higher pH means lower quality- and, as such, it actually accurately reflects the graph made for quality, above- only reversed.

Plot Two

Description Two

I chose to use the bivariate plot of quality and alcohol content, with some variations to make the plot more readable. This plot is interesting first because it shows a clear positive trend of higher content of alcohol in higher quality wines, especially when examining the median line added here.

Plot Three

Description Three

This plot is interesting because it relates not only to the quality, but to the taste of the wine. Anyone concerned with this dataset for consumption purposes would find this vital- that the wines with higher levels of citric acid, or the ones described as lighter, fruitier, and crisper, would be of higher quality. I’ve also added a line so that it is possible to see the mean throughout.

Reflection

Upon reflection, I am surprised to see that not many of these variables correlate to quality. For instance, I would have expected residual sugar or any of the acidities to correlate with quality, and that was not the case. Frequently while working with this data I wished that the set was much larger, so that it would be easier to see clearer trends. I was surprised also to see the number of mid-quality wines that shared almost the same aspects as the higher quality wines, meaning that consumers who purchase very expensive bottles may be drinking a wine that is, in essence, the same as another less expensive wine.